# Multimodal processing

Gemma 3n E2B It Unsloth Bnb 4bit
Gemma 3n-E2B-it is a lightweight open-source multimodal model launched by Google, built on the same technology as Gemini and optimized for low-resource devices.
Image-to-Text Transformers English
G
unsloth
4,914
2
Gemma 3n E2B
Gemma 3n is a lightweight and state - of - the - art open - source model family launched by Google, supporting multimodal input and output.
Image-to-Text Transformers
G
google
206
11
Gemma 3n E4B It
Gemma 3n is a lightweight and state-of-the-art open-source multimodal model family launched by Google. It is built on the same research and technology as the Gemini model and supports text, audio, and visual inputs.
Image-to-Text Transformers
G
google
1,690
81
Nuextract 2.0 4B
MIT
NuExtract 2.0 is a series of multimodal models specifically trained for structured information extraction tasks. It supports text and image inputs and has multilingual processing capabilities.
Image-to-Text Transformers
N
numind
272
3
Google.gemma 3 4b It Qat Int4 Unquantized GGUF
A quantized version of the image-to-text model based on Gemma 3 4B, aiming to make knowledge accessible to the public
Image-to-Text
G
DevQuasar
161
1
Gemma 3 4b It Qat Autoawq
Gemma 3 is a lightweight open-source multimodal model launched by Google, built on Gemini technology, supporting text and image input and generating text output.
Image-to-Text Safetensors
G
gaunernst
503
1
Smoldocling 256M Preview Mlx Fp16
Apache-2.0
This model is converted from ds4sd/SmolDocling-256M-preview to the MLX format, supporting image-text-to-text tasks.
Image-to-Text Transformers English
S
ahishamm
24
1
Gemma 3 27b Pt Bnb 4bit
Gemma 3 is a lightweight open model series launched by Google, built on the same research and technology as the Gemini model, supporting multimodal input and text output.
Image-to-Text Transformers English
G
unsloth
2,009
1
Gemma 3 1b Pt Unsloth Bnb 4bit
Gemma 3 is a series of lightweight open models launched by Google, supporting multimodal input (text and images), with a 128K large context window, suitable for various tasks such as question answering and summarization.
Image-to-Text Transformers English
G
unsloth
4,481
3
Kaleidoscope Large V1
A document Q&A specialized model fine-tuned based on sberbank-ai/ruBert-large, supporting Russian and English document Q&A tasks.
Question Answering System Transformers Supports Multiple Languages
K
2KKLabs
214
2
Kaleidoscope Large V1
A document QA model fine-tuned from sberbank-ai/ruBert-large, excelling at extracting answers from documents, supporting Russian and English.
Question Answering System Transformers Supports Multiple Languages
K
LaciaStudio
297
0
Kaleidoscope Small V1
A document question-answering model fine-tuned based on sberbank-ai/ruBert-base, excelling at extracting answers from document contexts, supporting Russian and English.
Question Answering System Transformers Supports Multiple Languages
K
2KKLabs
98
0
Ola Image
Apache-2.0
Ola-7B is a multimodal language model jointly developed by Tencent, Tsinghua University, and Nanyang Technological University, based on the Qwen2.5 architecture. It supports processing image, video, audio, and text inputs and outputs text.
Multimodal Fusion Safetensors Supports Multiple Languages
O
THUdyh
61
3
Mineru
Apache-2.0
This model converts PDF documents into Markdown format while preserving the original document layout structure and accurately recognizing mathematical formulas and tables.
Image-to-Text Transformers Supports Multiple Languages
M
kitjesen
122
12
Pixtral 12b Nf4
Apache-2.0
A 4-bit quantized version based on the Mistral community's Pixtral-12B, focusing on image text-to-text tasks and supporting Chinese description generation.
Image-to-Text Transformers
P
SeanScripts
236
20
Florence 2 DocVQA
This is a version of Microsoft's Florence-2 model fine-tuned for 1 day using the Docmatix dataset (5% of the data) with a learning rate of 1e-6
Text-to-Image Transformers
F
HuggingFaceM4
3,096
60
Kosmos 2 PokemonCards Trl Merged
This is a multimodal model fine-tuned based on Microsoft's Kosmos-2 model, specifically designed for recognizing Pokemon names on Pokemon cards.
Image-to-Text Transformers English
K
Mit1208
51
1
Cellseg Sribd
Apache-2.0
Cell segmentation model developed by Sribd-med team, suitable for cell instance segmentation tasks in multimodal images
Image Segmentation Transformers English
C
Lewislou
23
0
Donut Base Finetuned Latvian Receipts V2
MIT
A model based on the Donut architecture, specifically fine-tuned for Latvian receipt data
Text Recognition Transformers
D
Inesence
13
0
S2t Small Mustc En De St
MIT
A speech-to-text transformer model trained for end-to-end English-to-German speech translation
Speech Recognition Transformers Supports Multiple Languages
S
facebook
156
0
S2t Small Mustc En Ro St
MIT
A Transformer-based end-to-end speech translation model designed for English to Romanian speech translation
Speech Recognition Transformers Supports Multiple Languages
S
facebook
19
0
S2t Small Mustc En Fr St
MIT
End-to-end English-to-French speech translation model based on S2T architecture, trained on the MuST-C dataset
Speech Recognition Transformers Supports Multiple Languages
S
facebook
2,326
2
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase